\chapter{Methodology}
\ifpdf
    \graphicspath{{Chapter3/Chapter3Figs/PNG/}{Chapter3/Chapter3Figs/PDF/}{Chapter3/Chapter3Figs/}}
\else
    \graphicspath{{Chapter3/Chapter3Figs/EPS/}{Chapter3/Chapter3Figs/}}
\fi

For a Wikipedia article, all its revisions were extracted and analyzed. The revisions that corresponded to the entry of a new contributor were separated out. This led to formation of discrete levels of interaction. For example, the knowledge contained in the revisions corresponding to the $k^{th}$ new user represented the collboration and interaction of the older $k - 1 $users. In order to quantify the knowledge in a particular version of the article, we followed the below-mentioned approaches. \\
\section{Trivial Approach}
%\markboth{\MakeUppercase{\thechapter. Methodology }}{\thechapter. Methodology}
Knowledge accumulation can be directly related to the quality of an article. The most basic approach for quality measure of an article rests on the number of words contained in it. It can be easily inferred that the more detailed and descriptive an article is, the richer is its knowledge content as is given in [7].\\
In this approach, we maintained a word count for each of the discretized revisions which served as the value for the page. This word count excluded the stopwords. Stopwords included most commonly used words such as prepositions,articles and conjuctions. Punctuation marks and numeric values were also removed. The cases of evident/suspected vandalism were also excluded.\\

Let $D_{0}$ be the set of discretized revisions and $V_{0}$ be the set of revisions suspected for Vandalism. Now,\\
\begin{displaymath}
D = \frac{D_{0}}{ V_{0}} 
\end{displaymath}
where, D is the required set of discretized revisions.
Let $W_{i}$ denote the set of all words contained in the ith revision in D, $W_{ij}$ reperesent the $j^{th}$ unique word in the $i^{th}$ revision. Now, value of the revision i would be \\
\begin{displaymath}
val(i) = |W_{i} |
\end{displaymath}
The value function \textit{val} is calculated for the set of revisions D.

 

\section{Age Approach}
The trivial approach, mentioned above, had a major drawback. It does not include the concept of acceptance of knowledge by the community. For instance, the words in the discrete revisions could have just been added and be unreviewed. To overcome this drawback, we give more weightage to those words in the revision that have passed more number of reviews. We call this weight as \textbf{Age of the word}F. This concept was first presented by [16]. \\Let a revision \textit{i }in \textit{D }be the $k^{th}$ revision of the article. Now, let’s say that the word $W_{ij}$ passes $r_{ij}$ revisions. Clearly,
$r_{ij} < k $. Now,
\begin{displaymath}
val(i) = \sum\limits_{j= 1}^{|W_{i}|} r_{i,j} \leq  k ∗ |Wi |
\end{displaymath}
The equality holds if there is no addition of new words in the revision.


\section{Frequency - Age Approach}

So far, we have been dealing with unique words in the article. However, a word may have multiple occurences in one or more revisions. We associate re-occurance of a word in the article with its importance. It is important to note here that the commonly occuring words were alreay removed before beginning to evaluate. So, the word has to be important enough to be repeated in the article.\\
Let, frequency of the word $W_{ij}$ be $f_{ij}$ . The value function for the revision becomes:
\begin{displaymath}
val(i) = \sum\limits_{j= 1}^{|W_{i}|} r_{i,j} . f_{i,j}
\end{displaymath}

\section{Root of Frequency Approach}

In the previous section, we discussed the role of frequency of a word in its importance. However, as each occurance of word did not pass the same number of revisions, or shall we say that each occurance of the word was not strictly scrutinized by the same number of reviewers. In order to balance this, we multiply square root of the frequency of each word to it's age and sum them to obtain the value of the revision. Again, the values of the discretized revisions are plotted against the number of revisions.
\begin{displaymath}
Q_{x} = \sum\limits_{j = 1}^{|W_{i}|} r_{i,j} . \sqrt{f_{i,j}}
\end{displaymath}

\section{ Root of Age Approach}
As discussed in the root of frequency approach, each occurance of the word in any revision does not have the same age. If the word was first added in, say $i_{th}$ revision, and it's second occurance was seen in, say $i+7_{th}$ revision, the two need to have different ages. If we treat the two occurances as two different words, it will be very difficult to delete the correct occurance when it is removed from the page. To simplify, we use square root of age in the metric.
\begin{displaymath}
Q_{x} = \sum\limits_{j = 1}^{|W_{i}|} \sqrt{r_{i,j}} . f_{i,j}
\end{displaymath}

\section{ Average Frequency Approach}
The above mentioned metric assume the average values of the frequency and age to be the root values. However, to gain higher accuracy, we tried to keep track of the new occurances of any word and update the averages accordingly. At every step, average frequency would be equal to the sum of the product of average frequency at the previous step and age of the word and current frequency divided by the new age.
\begin{displaymath}
Q_{x} = \sum\limits_{j= 1}^{|W_{i}|} r_{i,j} . Avg(f_{i,j}) 
\end{displaymath}
\begin{displaymath}
Avg(f_{i,j}) = \frac {Avg(f_{i-1,j}) . r_{i-1,j} + f_{i,j} } {r_{i,j} }
\end{displaymath}

\section{Average Age Approach}
The need to find the average value of age of a word was explained in the root age approach. To find a more accurate approximation of the average value of age of the word, we use the following formula:
\begin{displaymath}
Q_{x} = \sum\limits_{j= 1}^{|W_{i}|} Avg(r_{i,j} ) . (f_{i,j}) 
\end{displaymath}
\begin{displaymath}
Avg(r_{i,j}) = \frac {(Avg(r_{i-1,j}) . f_{i-1,j}) + (f_{i,j} - f_{i-1, j} ) } {f_{i,j} }
\end{displaymath}

% ------------------------------------------------------------------------


%%% Local Variables: 
%%% mode: latex
%%% TeX-master: "../thesis"
%%% End: 
